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Abstract 

We present a novel approach to ray tracing execution on commodity graphics hardware using CUBA. We decom- 
pose a standard ray tracing algorithm into several data-parallel stages that are mapped efficiently to the massively 
parallel architecture of modern GPUs. These stages include: ray sorting into coherent packets, creation of frus- 
tums for packets, breadth-first frustum traversal through a bounding volume hierarchy for the scene, and localized 
ray-primitive intersections. We utilize the well known parallel primitives scan and segmented scan in order to 
process irregular data structures, to remove the need for a stack, and to minimize branch divergence in all stages. 
Our ray sorting stage is based on applying hash values to individual rays, ray stream compression, sorting and de- 
compression. Our breadth-first BVH traversal is based on parallel frustum-bounding box intersection tests and 
parallel scan per each BVH level. 

We demonstrate our algorithm with area light sources to get a soft shadow effect and show that our concept is rea- 
sonable for GPU implementation. For the same data sets and ray-primitive intersection routines our pipeline is ~3x 
faster than an optimized standard depth first ray tracing implemented in one kernel. 

Categories and Subject Descriptors (according to ACM CCS): 1.3.7 [Computer Graphics]: Three-Dimensional 
Graphics and Realism - Raytracing. 



1. Introduction 

Ray tracing [WhiSO; CPC84; ShiOO] is a flexible tool to get 
realistic visual effects such as soft shadows, glossy reflec- 
tions and global illumination. Ray tracing algorithms rely 
on a fundamental trace() method. The purpose of this 
traceO method is to find the intersection with the closest 
scene primitive for a given ray. Acceleration structures 
such as bounding volume hierarchies (BVHs) [KK86] built 
on scene geometry make the complexity of trace() loga- 
rithmic with respect to the scene size. Ray tracer execution 
consists of acceleration structure construction, sampling 
techniques, acceleration structure traversal, ray-primitive 
intersection and shading. Ray tracing is a parallel algo- 
rithm, since all the rays within the same ray generation can 
be traced independently. 

Our paper is about a faster trace{ ) method (that includes 
acceleration structure traversal and ray-primitive intersec- 
tion), and not about sampling techniques, shading or accel- 
eration structure construction [Wal07; LGS*09]. 
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Figure 1: Our ray tracing pipeline. Tracing method is 
decomposed into 4 stages: ray sorting, frustum creation, 
breadth-first traversal and localized intersection tests. 
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Modem desktop GPUs have ~1 TFlops of compute 
power and -150 Gb/sec of bandwidth (such as NVIDIA 
GTX 285). But these compute devices are best suited for 
streaming data-parallel algorithms with a local and coher- 
ent execution and memory access patterns. 

In this paper we propose a novel packet-based ray trac- 
ing pipeline where the trace() method is represented in four 
stages: ray sorting into coherent packets, creation of the 
frustums for packets, breadth-first frustum traversal 
through a bounding volume hierarchy for the scene, and 
localized ray-primitive intersections (see Fig. 1). Our first 
contribution is the ray sorting stage that is based on apply- 
ing hash values to individual rays, ray stream compression, 
sorting and decompression. Our second contribution is the 
stack-less breadth-first BVH traversal that is based on par- 
allel frustum-bounding box intersection tests and parallel 
scan per each BVH level. 

Our work was inspired by Arvo and Kirk's 5D ray trac- 
ing [AK87]. In that work a volume in 5D space (3D for 
origin and 2D for direction) is used to represent a collec- 
tion of rays. The initial 5D volume is decomposed into a 
tree of hypercubes that are linked to lists of scene primi- 
tives. Using this tree, the rays (5D points) are associated 
with a hj'percube and tested for intersection with a local list 
of primitives. In contrast, we extract the coherent packets 
of required rays; efficiently build the list of primitives per 
each packet using a 3D bounding volume hierarchy for the 
scene (and all the tracing stages are executed on a GPU). 

2. Background 

2.1 GPU Computing Model 

Modem GPUs are composed of several independent SIMD 
cores [NVIDIA; NBGS08]. Each core can execute multiple 
threads in parallel that communicate via shared on-chip 
memory. The GPU supports thousands of parallel threads 
that execute 1 simple program (kemel) for different data 
elements. The invocation of these kemels is organized in 
the host program (mnning on the CPU). GPU threads are 
subdivided into blocks (each block is executed on the sin- 
gle GPU core). Shared memory is much faster than global 
GPU memory and can be used for inter-thread communica- 
tion within a single block of threads. Each block of threads 
is organized into several warps (bundles of 32 threads) 
which execute a single kernel instmction for the entire 
warp of threads. 

SIMD/SIMT. SIMT (single instruction multiple threads) 
is a superset of SIMD where thread divergence is handled 
by hardware. This feature simplifies programming but in 
order to get high GPU compute utilization one should or- 
ganize the execution so that all the threads within a warp 
make the same branching decisions and access coherent 
memory locations. If the code of the kemel contains condi- 
tions and if some threads within a warp take different 
branches then they will execute both code paths bringing 



all the other threads of the warp with them. This case pro- 
vides multi-branching and complicated memory access 
pattern, resulting in a loss of compute efficiency. 

Parallel Scan and Segmented Scan. We utilize the well 
known parallel primitives scan and segmented scan 
[SHG08] in order to process irregular data structures. Par- 
allel scan (or prefix sum) for a given array results in the 
output array where the element is the sum of all the pre- 
vious elements of the input array (including the /* element 
for inclusive scan and excluding the i"' element for exclu- 
sive scan). Parallel segmented scan is the same as scan, but 
this procedure computes the sum of all the previous ele- 
ments within a segment of input array. The input array may 
contain arbitrary sized segments that are concatenated in a 
contiguous array. The segments of the input array are 
specified by an additional array of head flags where all the 
elements set to one denote the bases of segments; all the 
zero elements denote the tails of segments. [CUDPP] pro- 
vides a library of these parallel primitives. 

2.2 Ray Tracing Review 

Coherent Ray Tracing. Efficient ray-packet generation 
and tracing is a challenging task. Recent research has fo- 
cused significant effort on improving high coherence situa- 
tions [GPSS07; WBS07; HMOS; BW09]. These techniques 
are successful for primary rays, hard shadows or soft shad- 
ows with small area lights. This comes from grouping co- 
herent rays together, bounding them within a tight frustum 
and then performing the same traversal process for the 
group of rays. Such a technique amortizes the average 
computational and memory pattem access costs for a single 
ray. Ray packet grouping is based on the original screen- 
space layout of rays. 

Breadth-first ray tracing and ray reordering. Breadth 
first ray tracing was first investigated by Hanrahan 
[Han86]. The idea behind breadth-first ray tracing is to 
form a set of rays and intersect each BVH node against this 
set during top down traversal. The set of rays is gradually 
reordered for deeper BVH levels [GR08; ORM08] (if there 
is no ray-box intersection then the ray reference is elimi- 
nated from the set of rays that descends to the node's chil- 
dren). This algorithm amortizes the cost of node access 
pattern among the rays. A stack is used to defer intersection 
tests for neighboring nodes within a BVH. Our breadth- 
first BVH frustum traversal is based on the full parallel 
scan for all fmstums (and rays) per each BVH level and 
does not use a stack. Boulos et al. [BWB08] described a 
combination of CPU-based ray reordering techniques for 
improved coherence and effect on performance. 

GPU Ray Tracing. Most GPU ray tracers are imple- 
mented by mapping each ray to one thread. Each ray needs 
a separate hierarchy traversal stack with a depth equal to 
the maximum depth of the scene hierarchy. The stack is 
usually stored in a private thread memory (local memory in 



e 2009 The Author(s) 

Journal compilation © 2009 The Eurographics Association and Blackwell Publishing Ltd. 



Kirill Garanzha & Charles Loop / Fast Ray Sorting and Breadth-First Packet Traversal for GPU Ray Tracing 



CUDA) that is slower than shared memory since it is 
mapped to global GPU memory [NVIDIA]. Several works 
[HSHH07; GPSS07] described ways to eliminate or miti- 
gate stack usage in a GPU ray tracer. But these approaches 
have no solution for the warp-wise multi-branching prob- 
lem that was analyzed by Aila and Laine [AL09]. This 
problem was mitigated by using persistent threads that 
fetch the ray tracing task per each idle warp of threads. 
Some warps within a block of threads become idle if one 
warp executes longer than others. In our ray tracing pipe- 
line we eliminate this warp-wise multi-branching at ex- 
pense of a long pipeline and special ray sorting. Roger et 
al. [RAH07] presented a GPU ray-space hierarchy con- 
struction process based on screen-space indexing. Rays 
were not actually sorted for better coherence. 

3. GPU Ray Tracing PipeUne 

In order to map ray tracing to efficient GPU execution we 
decompose ray tracing into 4 stages: ray sorting, frustum 
creation, breadth-first traveral, and localized ray-primitive 
intersections (see Fig. I). 

Ray sorting is used to store spatially coherent rays in 
consecutive memory locations. Compared to unsorted rays, 
the tracing routine for sorted rays has less divergence on a 
wide SIMD machine such as GPU. Extracting packets of 
coherent rays enables tight frustum creation for packets of 
rays. We explicitly maintain ray coherence in our pipeline 
by using this procedure. 

We create tight frustums in order to traverse the BVH 
using only frustums instead of individual rays. For each 
frustum we build the spatially sorted list of BVH-leaves 
that are intersected by the frustum. Given that the set of 
frustums is much smaller than the set of rays, we perform 
breadth-first frustum traversal utilizing a narrower parallel 
scan per each BVH level. 

In the localized ray-primitive intersection stage, each ray 
that belongs to the frustum is tested against all the primi- 
tives contained in a list of sorted BVH-leaves captured in a 
previous stage. 

3.1 Ray Sorting 

Our ray sorting procedure is used to accelerate ray tracing 
by extracting coherence and reducing execution branches 
within a SIMD processor. However, the cost of such ray 
sorting should be offset by an increase in performance. We 
propose a technique that is based on compression of key- 
index pairs. Then we sort the compressed sequence and 
decompress the sorted data. 

Ray hash. We create the sequence of key-index pairs by 
using the ray id as index, and a hash value computed for 
this ray as the key. We quantize the ray origins assuming a 
virtual uniform 3D-grid within scene's bounding box. We 
also quantize normalized ray directions assuming a virtual 



uniform grid (see Fig. 2). We manually specify the cell 
sizes for both virtual grids (see section 5.1). With quan- 
tized components of the origin and direction we compute 
cell ids within these grids and merge them into a 32-bit 
hash value for each ray. Rays that map to the same hash 
value are considered to be coherent in the 3D-space. 




Figure 2: The quantization of ray origin and direction is 
used to compute a hash value for a given ray. 

Sorting. We introduce a "compression - sorting - de- 
compression" (CSD) scheme (see Fig. 3) and explicitly 
maintain coherence through all the ray bounce levels. Co- 
herent rays hit similar geometry locations. And these hit 
points form ray origins for next-generation rays (bounced 
rays). There is a non-zero probability that some sequen- 
tially generated rays will receive the same hash value. This 
observation is exploited and sorting becomes faster. The 
compressed ray data is sorted using radix sort [SHG09]. 



Ray IDs: 
Hash values: 



012345678910 1112 13141516171819 




Sort (radix, bitonic) 













3 9 




17 


0 


14 


4 5 


2 


3 


3 


3 


Decern Rress^^==^!^^^^SSHrEE;^E^|===i^^ 


3 4 5|6|9|l0|ll|l213 7 817|l8|l9|0|l 2 14 isfli] 


^^^^^m 



Chunk Hash: 
Chunk Base: 
Chunk Size: 

Reordered IDs: 
Hash values: 



Figure 3: The overall ray sorting scheme. 

Compression. We create the array Head Flags equal in 
size to the array Hash values. All the elements of Head 
Flags are set to 0 except for the elements whose corres- 
ponding hash value is not equal to the previous one (see 
Fig. 4). We apply an exclusive scan procedure [SHG08] to 
the Head Flags array. 
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Figure 4: Compression example. 
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We then perform data compaction into Chunk Base and 
Chunk Hash aiTays: for eacii Head FlagSj = i we write the 
value ; into position of Chunk Base array specified by 
Scan(Head Flagsjj. Analogously, we build Chunk Hash 
array. The values of Chunk Size elements are equal to dif- 
ferences between neighboring Chunk Base elements. 

Decompression. When the compressed data is sorted we 
apply an exclusive scan procedure to the Chunk Size array 
(see Fig. 5). We initialize the array Skeleton with ones, and 
the array Head Flags with zeroes (the sizes of both arrays 
are equal to Hash values array). Into positions of the array 
Skeleton specified by Scan( Chunk Size) we write the cor- 
responding values of Chunk Base array. Into positions of 
the array Head Flags specified by Scan(Chunk Size) we 
write ones. We then apply an inclusive segmented scan 
[SHG08] to array Skeleton considering the Head Flags 
array that specifies the bounds of data segments. The result 
of the segmented scan is the array of reordered (sorted) ray 
ids corresponding to their hash values. 
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SegScan(S, F): [3 | 4 | 5 | 6 | 9 |io|ii |i2|i3| 7 | 
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Reordered IDs = SegScan(S, F) 
Figure 5: Decompression example. 

Decomposition: packet ranges extraction. We would 
like to create packets of coherent rays no larger than some 
capacity (e.g., MaxSize = 256). First, we extract the base 
index and range of each cell that contains the chunk of rays 
with the same hash value. In order to do this we apply the 
compression procedure described above to the array of 
sorted rays. As a result each element of Chunk Size repre- 
sents the number of rays assigned to the coiTesponding grid 
cell (see Fig. 6). 
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Ray Packet Base = SegScan(S, F) 

Figure 6: Decomposition example. On this example each 
chunk is decomposed into the packets of MaxSize = 4. 



We create the array numPackets where numPacketSi = 
(ChunkSizCi + MaxSize - 1) / MaxSize and then scan this 
array. All the values of Skeleton are initially set to MaxSize 
and all values of Head Flags are set to zero. Into positions 
of the array Skeleton specified by Scan(numPackets) we 
write the corresponding values of array Chunk Base. Into 
positions of the array Head Flags specified by 
Scan(numPackets) we write ones. As in the decompression 
procedure, we apply an inclusive segmented scan to array 
Skeleton considering the Head Flags. The result of this 
segmented scan is the array of base indices for each ray 
packet, the size of a ray packet is found as the difference of 
consecutive bases. 

3.2 Frustum Creation 

Once the rays are sorted and packet ranges extracted, we 
build a frustum for each packet. As in the work [ORM08], 
we define the frustum by using a dominant axis and two 
axis-aligned rectangles. The dominant axis corresponds to 
the ray direction component with a maximum absolute 
value. For the coherent rays of a packet this axis is assumed 
to be the same. The two axis-aligned rectangles are 
perpendicular to this dominant axis and bound all the rays 
of the packet (see Fig. 7). 




Figure 7: Frustum is defined by dominant axis X and two 
axis-aligned rectangles. 

We implemented the frustum creation in a single CUDA 
kernel where each frustum is computed by a warp of (32) 
threads. Shared memory is used to compute the valid in- 
terval along the dominant axis and base rectangles for all 
the rays in a packet. 

3.3 Breadth-First Frustum Traversal 

We perform breadth-first frustum traversal through the 
BVH with the arity equal to eight. The binary BVH is 
constructed on the CPU and 2/3'^**' of tree levels are 
eliminated and an Octo-BVH is created (all the nodes are 
stored in a breadth-first storage layout). Each BVH-node is 
represented with 32 bytes: six float values for the axis- 
aligned bounding box (AABB), one 32-bit integer value 
represents the block of children (3 bytes for the base offset 
of the block and 1 byte for the number of children), and 
one 32-bit integer for the spatial order of children within 
this node. All the children within the node are sorted in a 
spatial 3D ascending order (see Fig. 8). 

Per frustum child ordering. For each frustum, a 3-bit 
value of F(DirSigns) is computed that corresponds to the 
sign bits of the average frustum's ray direction. The spatial 
order of node's children along the frustum direction is 
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computed using an xor-operation: SortedChildrenfi] = i 
F(DirSigns). See Fig. 8 for examples. Tliough tliis ordering 
may not be exact for all tfie cases, it is simple and sufficient 
for our purposes. 

F(DirSigns)= 11 
Sorted Children: 

F(DirSigns) = 10 ^ 
Sorted Children: 



|10|11|00|0l1 
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F(DirSigns) = 00 
Sorted Children: 



Within a node ail the 
children are stored in a 
spatial 3D ascending order 

F(DirSigns) = 01 
Sorted Children: 



Figure 8: Four examples of node's children ordering 
along the frustum are presented in a 2D projection. 

Traversal. For simplicity, we allocate a large memory 
block for the working queues (e.g. 4-8 million elements) 
used in the traversal procedure. We initialize the array 
Qin.FrustumlDs with frustum ids, and the array of 
Qin.NodelDs with the BVH root node id (i.e. with a zero). 
At each iteration step (for each level) of the traversal pro- 
cedure we perform intersection tests for each frustum cor- 
responding to Qin.FrustimilDSj with all the children of the 
BVH-node specified by Qin.NodelDsj (using AABB- 
Frustum culling). The number of intersected children is 
saved in numlntersectedj. Spatially sorted 3 -bit local offsets 
of the intersected children are packed in a single word 
SortedChildrerii (see Fig. 9). We apply an exclusive scan 
procedure to the numlntersected array. Into the chunk of 
the an'ay Qout.FrustumlDs specified by offset equal to 
Scan(nnmlntersected)i and length equal to numlntersectedj 
we propagate the value of Qin.FrustumlDs j. Into the corre- 
sponding chunk of Qout.NodelDs we write global ids of 
the children that were packed in SortedChildrenj. A global 
id for k-th child of node is computed as the sum of k and 
the base offset of node's children. Then we swap the point- 
ers Qin and Qout. This iterative procedure is repeated for 
all the levels of the BVH. The leaf-nodes of unbalanced 
tree are considered to be the children of themselves and we 
bring them to the bottom level of the tree. The number of 
levels is reduced by using Octo-BVH. 
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Figure 9: Breadth-first frustums' traversal example. 



After all traversal iterations are finished the 
Qout.FrustumlDs array contains the irregular chunks with 
the same values (frustum ids were propagated per chunk). 
The corresponding chunks of Qout.NodelDs array contain 
the references to the intersected BVH leaves that are ap- 
proximately spatially sorted along the frustums. We extract 
the array of distinct frustum ids (active frustums) and the 
corresponding ranges of leaf-chunks with a compression 
procedure applied for Qout (see section 3.1, replace Hash 
values with Qout.FrustumlDs in example Fig. 4). 

3.4 Localized Intersection Tests 

When we obtain the array of active frustum ids and corre- 
sponding ranges of leaf-chunks per frustum we decompose 
all the active frustums into chunks of 32 rays max (ray 
warps) - since the numbers of rays per frustum may not be 
equal. This decomposition is done analogously to the ex- 
ample in Fig. 6. Each ray warp is mapped to a CUDA 
thread warp execution (32 threads in a warp). All rays 
within a warp share the same frustum id that determines the 
computations and memory reads (see Fig. 10). This kind of 
execution eliminates multi-branching within a CUDA warp 
of threads and only ray masking remains. In this stage we 
exploit the sorted list of leaves per each frustum. For all 
rays in each ray warp the intersections are analyzed for the 
closest triangles first and ray distance parameters are up- 
dated. Using updated ray distance parameters we avoid 
intersection tests for occluded triangles (using ray-AABB 
test). We use CUDA persistent threads [AL09] in order to 
balance the workload since the frustums may capture ir- 
regular chunks of intersected leaves during traversal stage. 

while (ray warps are available) { // persistent 

RayWarp = f etch_next_warp ( ) ; // threads [AL09] 

Ray = f etch_ray (RayWarpBase + threadldx . x) ; 
Frustumid = frustum_id (RayWarp) ; 
for (all leaves (Frustumld) ) 

if (Ray intersects AABB (Leaf i) ) // mask rays 

for (all primitives ( Leaf i ) // coherent reads 
intersect Ray with a primitivej; 

) 

Figure 10: CUDA kernel of localized ray-primitive 
intersection tests, threadldx.x belongs to interval [0..32). 

4. Benchmark Implementation 

We test the new ray tracing pipeline for primary rays (at 
1024x768 resolution) and soft shadow rays (with area light 
source and 16 samples maximum per shade point). 

In the ray sorting stage we sort only ray origins for soft 
shadow samples (see Fig. 2). The virtual uniform grid for 
ray sorting has a user-specified size of the cell (cells are 
cubical). This size is selected as the fraction of scene's 
bounding box extents: CellSize = UserCellFraction * 
SceneBboxDiagonal. Ray directions are sorted per each 
origin according to the stratified sampling over the light 
source domain. For big area light sources the rays that hit 4 
or 16 different stratums are assigned to 4 or 16 different 
frustums (however some rays may share the same origins). 
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Within a compression-sorting-decompression (CSD) 
scheme we use a full 32-bit radix sort [SHG09]. It is possi- 
ble to apply a faster partial sorting procedure (e.g. bitonic 
sort from CUDA SDK) that locally sorts data within the 
equal chunks of compressed ray buffer. This procedure 
may extract reasonable, but not perfect ray coherence. 

Primaiy rays are indexed and sorted according to a 
screen-space Z-curve (256 rays per primary frustum). 

We use the Utah Fairy Forest (174K triangles), Confer- 
ence (280K triangles), and Sponza (68K triangles) as the 
test scenes for our algorithm (all tested viewpoints are pre- 
sented on Fig. 1 1). We build a binary BVH on the CPU for 
these models using a binning algorithm [Wal07]; we stop 
recursive construction and create a leaf when a BVH-node 
contains less than (or equal to) 10 triangles. As a ray- 
triangle intersection test we use the scheme of Moller- 
Trumbore [MT97]. 



(a) Fairy Forest (simple view) 



(b) Fairy Forest (complex view) 




;c) Conference 






(d) Sponza 
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^0 



All measurements were done using an NVIDIA GTX 
285 and all the kernels were compiled using CUDA 2.2 
[NVIDIA]. We compare the new ray tracing pipeline with 
our implementation of '"persistent speculative while-while" 
(the most efficient) ray tracing kernel described by Aila and 
Laine [AL09]. For both methods the same data-sets, BVH, 
intersection tests and viewpoints are used (for the new 
pipeline we convert the binary BVH to the Octo-BVH). 

When we measure ray tracing performance (millisec- 
onds, rays/second) for the new pipeline we take into ac- 
count only tracing-specific stages: ".sort rays", ''build frus- 
tums", "traversal", "intersection tests" (see Fig. 1). 

In our [AL09] implementation we replace three of our 
logic stages "build frustums", "traversal", "intersection 
tests" with a single "trace" kernel described in [AL09]. In 
all performance measurements we take into account only 
"trace" timings, but still this method works with sorted 
rays (for these rays CUDA thread warps should execute 
with less divergence). Ray sorting timings are not taken 
into account when we evaluate [AL09] performance. 

5. Results and Analysis 

5.1 Ray Quantization Parameters Evaluation 



UCF 


Fig.ll(a) 


Fig.ll(b) 


Fig.ll(c) 


Fig.ll(d) 






MaxSize= 


128 




0.002 


681k/ 108K/31 


432k /67k/ 31.9 


859k/ 119k/ 30 


721k/ 77k/ 30.8 


0.004 


668k /97k/ 3 1.6 


444k /66k/ 31.9 


691k /91k/ 30.9 


610k/66k/31.4 


0.008 


690k/91k/31.9 


461k /65k/ 31.9 


618k/78k/31.7 


557k/ 60k/ 31.8 


0.016 


709k/ 89k/ 31.9 


467k /65k/ 31.9 


640k/ 74k/ 31.9 


546k/ 59k/ 31.9 






MaxSize= 


256 




0.002 


521k/ 80k/ 30.5 


232k /35k/ 31.9 


945k/ 116k/ 29 


512k/ 50k/ 30.9 


0.004 


444k/ 57k/ 31.5 


233k /34k/ 31.9 


572k / 62k / 30.8 


378k/ 37k/ 31.5 


0.008 


445k /48k/ 3 1.9 


254k /33k/ 31.9 


438k /44k/ 31.6 


318k/32k/31.8 


0.016 


500k/46k/31.9 


258k /33k/ 31.9 


447k /39k/ 31.9 


303k/30k/31.9 






MaxSize= 


512 




0.002 


430k / 66k / 30.6 


130k/ 19k/ 31.8 


792k / 96k / 29.5 


494k /45k/ 30.8 


0.004 


317k/ 35k/ 31.5 


134k/ 18k/ 31.9 


431k/44k/31.0 


283k/ 25k/ 31.5 


0.008 


313k/ 26k/ 31.8 


144k/ 17k/ 31.9 


303k /26k/ 31.6 


189k/ 14k/ 31.8 


0.016 


328k/ 24k/ 31.9 


148k/ 16k/ 31.9 


294k /21k/ 31.9 


171k/ 16K/31.9 



Figure 11: Benchmark scenes and tested viewpoints. Right 
column: an example of soft shadow frustum origins. 



Table 1: Ray quantization parameters and working statistics 
for soft shadows: the number of leaves references captured 
for all frustums by traversal stage / the number of frustums / 
avgerage number of rays per ray warp (frustums are 
decomposed into ray warps in intersection stage). This data 
depends on the size of frustums; the size of each frustum 
depends on the UserCellFraction and MaxSize. A bigger 
value of UCF (UserCellFraction) denotes a bigger cubic cell 
in a virtual grid (according to which ray origins are 
quantized). MaxSize denotes the maximum capacity (in rays) 
per each frustum. This data is presented for the fixed area 
light source. Performance is given on Fig. 12. 



e 2009 The Author(s) 

Journal compilation © 2009 The Eurographics Association and Blackwell Publishing Ltd. 



Kirill Gamnzha & Charles Loop / Fast Ray Soiling and Breadth-First Packet Traversal for GPU Ray Tracing 



Traversal and intersection statistic for different ray quan- 
tization parameters are given in Table I. E.g. LS = "the 
number of leaves captured for all frustums" and FS = "the 
number of frustums" represent the working queue size of 
the breadth-first traversal. A relation LS / FS is the average 
number of leaves captured per each frustum. For all quanti- 
zation parameters (given in Table 1) a value of this relation 
is around 10. Given the fact that we build the BVH with 
-10 triangles per leaf the maximum number of triangle 
intersection tests per ray in our benchmarks should be 
-100. But all the leaves are sorted along the frustum direc- 
tion and we perform ray masking (i.e. if the AABB of the 
leaf is not intersected, see Fig. 10) so the actual number of 
ray-triangle intersection tests can be much lower. 
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Figure 12: Ray quantization parameters and performance 
in million rays / second for soft shadow rays (bigger 
numbers are better). Parameters meaning and stats are 
given in Table L 

Actual performance of ray tracing is given in Fig. 12 (for 
the viewpoints presented in Fig. 11) and it is not clear what 
parameters are the best for all scenes. However selecting 
MaxSize=256 and U serCellFraction=0.004 seems to be 
robust for high performance ray tracing and leads to rela- 
tively small working queues. We use these parameters for 
all the following measurements and comparisons. 

5.2 Ray Tracing Pipeline Stages 

Fig. 13 presents the time spent in different stages of our 
pipeline for soft shadows with a fixed light source. For the 
left chart 1024x768 elements are sorted in a ray sorting 
stage. This stage takes ~6ms for all scenes and includes 
hash value computation, compression, 32-bit radix sort, 
decompression, frustum ranges extraction. For the right 
chart 1024x768x16 elements are sorted in -40ms with a 
CSD scheme (including all the supplementary routines). In 
contrast, only the 32-bit radix sort (without CSD) for 
1024x768x16 elements takes ~80ms. 



Fig.11(a) 
Fig.11(b) 
Fig.11(c) 
Fig.11(d) 
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□ Ray Sorting 
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□ Localized Intersections 

Figure 13: Time spent in logic stages of ray tracing 
pipeline for soft shadow rays. Left chart: 16 shadow rays 
were generated per primary hit point. Right chart: 1 
shadow ray was generated per primary hit point (with 4x4 
per pixel antialiasing). For the right chart data there are 
16 shadow samples per pixel (and we sort 16x more ray 
origins overall than for the left chart data). 

5.3 Comparison with a Depth-first Ray Tracing 

The charts in Fig. 14 represent our pipeline in comparison 
to our implementation of the Aila and Laine approach 
[AL09]. The gap between two approaches is bigger for soft 
shadow rays that are less coherent since we reduce warp- 
wise branches in our ray tracing pipeline (we have only ray 
masking in intersection stage, see Fig. 10). 
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Soft Shadow rays (at 1024x768x16 samples): 
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Figure 14: Performance comparison of our ray tracing 
pipeline and our implementation of [AL09] (bigger 
numbers are better). See Fig. 1 1 for viewpoints. 

Performance measurements (rays per second) of the 
depth-first ray tracing implementation [AL09] may be dif- 
ferent from results published in this paper. We build the 
BVH with another algorithm without tessellating large 
triangles; we use different triangle intersection tests, differ- 
ent viewpoints and sampling techniques. But the input data 
and all these intersection routines are the same for our 
comparisons. Careful splitting of large triangles may pro- 
vide significant speedup for ray tracing (e.g. 2-3x [DK08; 
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Scalability: larger area light source (16 shadow samples) 
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Scalability: # of shadow rays for big area light source 
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Figure 15: Timings comparison of our pipeline and our implementation of [AL09] in a Fairy Forest (see Fig. 1 la,b) 
(smaller numbers are better). These charts represent the difference between two approaches in shadow tracing for larger 
area light sources (with the same # of rays) and for the different # of samples (with a fixed light source). 



EG07]). Our scenes contain large triangles. Low level op- 
timizations are not applied yet for our new pipeline and the 
implementation of [AL09]. 

5.4 Scalabihty 

The charts on Fig. 15 represent ray tracing timings for lar- 
ger light source (less coherent ray directions) and different 
sampling rates (less dense packets for our pipeline). The 
relative speedup of our approach compared to [AL()9] is 
increased for the viewpoint on Fig. 1 lb. This viewpoint 
was analyzed in our benchmarks since the depth of geome- 
try is highly varying in a screen-space as well as for 
shadow rays. This setup can be considered as the hard case 
for packet-based ray tracers (or ray tracers that rely on wide 
SIMD/SIMT units only). A difficult part for these tech- 
niques is that individual rays may make different decisions 
during traversal and follow different branches. But all these 
individual decisions will bring any other rays within a 
packet or warp of threads to the useless paths for them. 
This case provides multi-branching and complicated mem- 
ory access pattern that affects ray tracing performance. In 
our approach we explicitly maintain a linear execution 
without multi-branching but within a longer pipeline. 

5.5 Discussion 

The new ray tracing pipeline provides the possibility to 
trace relatively big packets of rays and perform efficient 
view-independent queries using a breadth-first frustum 
traversal. Memory access patterns for breadth-first traversal 
are coherent as we perform operations in parallel for each 
BVH level (and the BVH is stored in a breadth-first lay- 
out). The warp-wise execution and memory access pattern 



of the intersection stage are also coherent according to the 
explicit work-load organization. 

Since we store all the leaf references for all the frustums 
the memory consumption may be considerable (and we 
also store the rays). But this consumption may be reduced 
through using a screen-space tiling (send reasonably big 
tiles to render on the GPU) or even frustum depth tiling. 

A bad case for our algorithm (and for many others) 
would be if one frustum captures all the leaves of the 
scene's BVH and other frustums capture nothing during 
breadth-first traversal. This would cause a very unbalanced 
workload for the intersection stage that will be not hidden 
or amortized by persistent threads. In this case it is possible 
to replicate the ray hit information (hit distance parameter 
and primitive id) of this frustum into n instances. Then 
each of these instances perform parallel intersection tests 
with 1/n* of the leaves for this frustum. After all the inter- 
sections are tested then all results are merged by segmented 
parallel reduction with a min-operation. This reduction can 
be implemented much like segmented scan [SHG08]. 

BVH traversal. For our pipeline we convert a binary 
BVH to the Octo-BVH since this operation reduces the 
height of the tree by a factor of 2-3 and reduces the number 
of calls for the scan operation. Octo-BVH also increases 
the number of AABB-frustum culling tests and texture 
fetches per each traversal thread (8 node children are tested 
per thread). Overall, the breadth-first traversal stage with 
the Octo-BVH is 2x faster than with a binary BVH. 

We actually implemented the depth-first traversal stage 
for created frustums (where each frustum selects K leaves 
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and the intersections are performed in another stage). But 
depth-first traversal was 5x slower than a breadth-first one. 

Comparison to CPU ray tracer. We also implemented 
a CPU ray tracer with our packet assembling approach 
(sorting only origins of shadow samples). These origins 
were sorted using a binning approach for grids assigned to 
screen-space tiles. We employed CPU-friendly algorithms: 
tile-based multi-core parallelism, depth-first traversal, 
SIMD instructions, advanced triangle culling techniques 
and other state of the art CPU optimizations that are very 
similar to the approaches described in [BW09]. Although 
no details are presented on the charts, our new ray tracing 
pipeline running on a GTX 285 is ~4x faster than a CPU- 
friendly implementation running on a Core 2 Quad 2.4GHz 
Q6600. We used almost the same scene setup to compare 
both versions (the same scenes and BVHs, but different 
viewpoints and execution routines). 

6. Conclusion and Future Worli 

In this paper we have presented a novel ray tracing pipeline 
using CUDA. This pipeline consists of multiple data- 
parallel stages where the warp-wise multi-branching is 
eliminated. We rely on the ray sorting procedure and ex- 
plicitly maintain coherent execution within all the stages of 
our pipeline. We have proposed a simple compression- 
sorting-decompression (CSD) technique in order to accel- 
erate the ray sorting stage. We have proposed a fast stack- 
less breadth-first frustum traversal algorithm that supports 
view-independent queries by using a full parallel scan of 
each BVH level. 

An advantage of our work is that it is a software pipeline 
that runs on existing GPU (this means flexibility). We have 
also shown that the ray tracing executed in our pipeline is 
faster than current state of the art GPU or CPU ray tracers. 
Though our results and implementation are still prelimi- 
nary, we expect performance improvement for Whitted ray 
tracing and path tracing. A complete solution should take a 
large number of rays (e.g. a set of arbitrary rays from a 
single ray-generation level) at input and provide intersec- 
tion information at output (for shading stage). 

We are going to extend the application of our CSD tech- 
niques. First, we will implement a faster GPU-based accel- 
eration structure (BVH) builder using the Z-curve index- 
ing, sorting, compression, BVH-construction [LGS*09] 
and decompression. Second, we are going to accelerate the 
stream reordering for deferred shading [HLJHOQ]. 

Finally, we are going to integrate a Reyes-style adaptive 
smooth surface subdivision [EML09] into our ray tracing 
pipeline that will rely on frustum-patch subdivision crite- 
rion. Assuming the view-independent ray tracing queries, 
this feature should increase the visual quality of geometry 
appearance. 
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